Before you start

Set my seed

# Any number can be chosen 
set.seed(238428)

Testing timing of Rmd for server evaluation

# What time did we start running this script? 
start_time <- Sys.time()
start_time
## [1] "2024-03-09 14:31:08 EST"

Goals of this file

NOTE: This document was NOT run during class. Instead, it was run as an example to navigate the issues with the server crashing during the dada2 workflow.

  1. Use raw fastq files and generate quality plots to assess quality of reads.
  2. Filter and trim out bad sequences and bases from our sequencing files.
  3. Write out fastq files with high quality sequences.
  4. Evaluate the quality from our filter and trim.
  5. Infer Errors on forward and reverse reads individually.
  6. Identified ASVs on forward and reverse reads separately, using the error model.
  7. Merge forward and reverse ASVs into “contiguous ASVs”.
  8. Generate the ASV count table. (otu_table input for phyloseq.).

Output that we need:

  1. ASV Count Table: otu_table
  2. Taxonomy Table tax_table
  3. Sample Information: sample_data track the reads lots throughout DADA2 workflow.

Load Libraries

# Efficient package loading with pacman 
# Don't forget to install pacman and DT if you don't have it yet. :) 
pacman::p_load(tidyverse, BiocManager, devtools, dada2, 
               phyloseq, patchwork, DT, iNEXT, vegan,
               install = FALSE)

Load Data

# Set the raw fastq path to the raw sequencing files 
# Path to the fastq files 
raw_fastqs_path <- "data/01_DADA2/01_raw_gzipped_fastqs"
raw_fastqs_path
## [1] "data/01_DADA2/01_raw_gzipped_fastqs"
# What files are in this path? Intuition Check 
head(list.files(raw_fastqs_path))
## [1] "022um-Control_S36_L001_R1_001.fastq.gz"    
## [2] "022um-Control_S36_L001_R2_001.fastq.gz"    
## [3] "20210602-MA-ABB1F_S31_L001_R1_001.fastq.gz"
## [4] "20210602-MA-ABB1F_S31_L001_R2_001.fastq.gz"
## [5] "20210602-MA-ABB1P_S28_L001_R1_001.fastq.gz"
## [6] "20210602-MA-ABB1P_S28_L001_R2_001.fastq.gz"
# How many files are there? 
str(list.files(raw_fastqs_path))
##  chr [1:190] "022um-Control_S36_L001_R1_001.fastq.gz" ...
# Create vector of forward reads
forward_reads <- list.files(raw_fastqs_path, pattern = "R1_001.fastq.gz", full.names = TRUE)  
# Intuition Check 
head(forward_reads)  
## [1] "data/01_DADA2/01_raw_gzipped_fastqs/022um-Control_S36_L001_R1_001.fastq.gz"    
## [2] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABB1F_S31_L001_R1_001.fastq.gz"
## [3] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABB1P_S28_L001_R1_001.fastq.gz"
## [4] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABB1W_S58_L001_R1_001.fastq.gz"
## [5] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABS1F_S45_L001_R1_001.fastq.gz"
## [6] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABS1P_S65_L001_R1_001.fastq.gz"
# Create a vector of reverse reads 
reverse_reads <- list.files(raw_fastqs_path, pattern = "R2_001.fastq.gz", full.names = TRUE)
head(reverse_reads)
## [1] "data/01_DADA2/01_raw_gzipped_fastqs/022um-Control_S36_L001_R2_001.fastq.gz"    
## [2] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABB1F_S31_L001_R2_001.fastq.gz"
## [3] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABB1P_S28_L001_R2_001.fastq.gz"
## [4] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABB1W_S58_L001_R2_001.fastq.gz"
## [5] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABS1F_S45_L001_R2_001.fastq.gz"
## [6] "data/01_DADA2/01_raw_gzipped_fastqs/20210602-MA-ABS1P_S65_L001_R2_001.fastq.gz"

Assess Raw Read Quality

Evaluate raw sequence quality

Let’s see the quality of the raw reads before we trim

Plot 12 random samples of plots

# Randomly select 12 samples from dataset to evaluate 
# Selecting 12 is typically better than 2 (like we did in class for efficiency)
random_samples <- sample(1:length(reverse_reads), size = 12)
random_samples
##  [1] 59 45 49 80 40 68  8 43 89  5 55 67
# Calculate and plot quality of these two samples
forward_filteredQual_plot_12 <- plotQualityProfile(forward_reads[random_samples]) + 
  labs(title = "Forward Read: Raw Quality")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the dada2 package.
##   Please report the issue at <https://github.com/benjjneb/dada2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
reverse_filteredQual_plot_12 <- plotQualityProfile(reverse_reads[random_samples]) + 
  labs(title = "Reverse Read: Raw Quality")


# Plot them together with patchwork
forward_filteredQual_plot_12 + reverse_filteredQual_plot_12

Aggregated Raw Quality Plots

# Aggregate all QC plots 
# Forward reads
forward_preQC_plot <- 
  plotQualityProfile(forward_reads, aggregate = TRUE) + 
  labs(title = "Forward Pre-QC")

# reverse reads
reverse_preQC_plot <- 
  plotQualityProfile(reverse_reads, aggregate = TRUE) + 
  labs(title = "Reverse Pre-QC")

preQC_aggregate_plot <- 
  # Plot the forward and reverse together 
  forward_preQC_plot + reverse_preQC_plot

# Show the plot
preQC_aggregate_plot

[Insert some interpretation regarding the quality of the raw QC plots]

Here, we see that the plots are showing the standard Illumina output: The quality is higher at the beginning of the read and slowly gets worse and worse as the read progresses. This is typical of Illumina sequencing because of phasing. We also see that there’s a slightly lower quality in the reverse reads due to the less efficient chemistry.

Note: The first few bases at the beginning of the forward reads have a VERY LOW quality base. Take a look at the multiQC report to explore the data and confirm what you see here in the dada2 plot.

Prepare a placeholder for filtered reads

# vector of our samples, extract sample name from files 
samples <- sapply(strsplit(basename(forward_reads), "_"), `[`,1) 
# Intuition Check 
head(samples)
## [1] "022um-Control"     "20210602-MA-ABB1F" "20210602-MA-ABB1P"
## [4] "20210602-MA-ABB1W" "20210602-MA-ABS1F" "20210602-MA-ABS1P"
# Place filtered reads into filtered_fastqs_path
filtered_fastqs_path <- "data/01_DADA2/02_filtered_fastqs"
filtered_fastqs_path
## [1] "data/01_DADA2/02_filtered_fastqs"
# create 2 variables: filtered_F, filtered_R
filtered_forward_reads <- 
  file.path(filtered_fastqs_path, paste0(samples, "_R1_filtered.fastq.gz"))
length(filtered_forward_reads)
## [1] 95
# reverse reads
filtered_reverse_reads <- 
  file.path(filtered_fastqs_path, paste0(samples, "_R2_filtered.fastq.gz"))
head(filtered_reverse_reads)
## [1] "data/01_DADA2/02_filtered_fastqs/022um-Control_R2_filtered.fastq.gz"    
## [2] "data/01_DADA2/02_filtered_fastqs/20210602-MA-ABB1F_R2_filtered.fastq.gz"
## [3] "data/01_DADA2/02_filtered_fastqs/20210602-MA-ABB1P_R2_filtered.fastq.gz"
## [4] "data/01_DADA2/02_filtered_fastqs/20210602-MA-ABB1W_R2_filtered.fastq.gz"
## [5] "data/01_DADA2/02_filtered_fastqs/20210602-MA-ABS1F_R2_filtered.fastq.gz"
## [6] "data/01_DADA2/02_filtered_fastqs/20210602-MA-ABS1P_R2_filtered.fastq.gz"

Filter and Trim Reads

Parameters of filter and trim DEPEND ON THE DATASET. The things to keep in mind are:
- The library preparation: Are the primers included in the sequence? If so, they need to be trimmed out in this step.
- What do the above quality profiles of the reads look like? If they are lower quality, it is highly recommended to use maxEE = c(1,1).
- Do the reads dip suddenly in their quality? If so, explore trimLeft and truncLen

Check out more of the parameters using ?filterAndTrim to bring up the help page and do some googling about it. Some notes on two examples are below, with a description of a few of the parameters:

  1. In class dataset: This salinity gradient dataset was generated with the library preparation described by Kozich et al., 2013 AEM, the reads maintained high Phred Scores (above 30, even more typically above ~34) all the way through to the end of the sequence. Therefore, we will not truncate the data for this dataset and we will use a less stringent maxEE = c(2,2).
  2. Lower quality datasetsHowever, if the sequence quality was lower, it’s recommended to use maxEE = c(1,1) as is in the commented out example below. In the other example (not executed but the code is shown) the dataset had the primers in their sequence (as they were prepared using the Illumina 2-step PCR protocol and therefore include the primer in the sequence. In addition with trimming the My reverse reads are of much lower quality, which is typical, so I truncated these at a lower value, where the Phred Scores drop below about 30. The values of these parameters are going to be highly specific to your own analysis, but you can also play it safe and go with the default values listed on the official DADA2 pipeline.
  • maxEE is a quality filtering threshold applied to expected errors. Here, if there’s 2 expected errors. It’s ok. But more than 2. Throw away the sequence. Two values, first is for forward reads; second is for reverse reads.
  • trimLeft can be used to remove the beginning bases of a read (e.g. to trim out primers!)
  • truncLen can be used to trim your sequences after a specific base pair when the quality gets lower. Though, please note that this will shorten the ASVs! For example, this can be used when the quality of the sequence suddenly gets lower, or clearly is typically lower. So, if the quality of the read drops below a phred score of 25 (on the y-axis of the plotQualityProfile above, which indicates ~99.5% confidence per base).
  • maxN the number of N bases. Here, using ASVs, we should ALWAYS remove all Ns from the data.
# Assign a vector to filtered reads 
# trim out poor bases, first 3 bps on F reads
# write out filtered fastq files 
# Here, in this class dataset, the Kozich et al.(2013) AEM
      # Link to paper: https://doi.org/10.1128/AEM.01043-13
# Therefore, we do not need to trim the primers, because they were not sequenced
filtered_reads <- 
  filterAndTrim(fwd = forward_reads, filt = filtered_forward_reads,
              rev = reverse_reads, filt.rev = filtered_reverse_reads,
              maxN = 0, maxEE = c(2,2), trimLeft = 3,
              truncQ = 2, rm.phix = TRUE, compress = TRUE, multithread = TRUE)


##### Another example of a filterAndTrim Step 
# This dataset is from sequences generated by Cornell Sequencing Facility using  
# the typical Earth Microbiome protocol mentioned in their 16S SOP:
        # https://earthmicrobiome.org/protocols-and-standards/16s/
# Here, the library prep was the standard Illumina 2-step PCR preparation and following
# In this example, the primer sequences ARE in the sequences and must be trimmed using trimLeft
#
#filterAndTrim(forward_reads, filtered_forward_reads,
#              reverse_reads, filtered_reverse_reads,
#              truncLen = c(240,220), trimLeft = c(19,20),
#              maxN = 0, maxEE = c(1,1), truncQ = 2, 
#              rm.phix = TRUE, compress = TRUE, 
#              multithread = TRUE)

Assess Trimmed Read Quality

# Plot the 12 random samples after QC
forward_filteredQual_plot_12 <- 
  plotQualityProfile(filtered_forward_reads[random_samples]) + 
  labs(title = "Trimmed Forward Read Quality")

reverse_filteredQual_plot_12 <- 
  plotQualityProfile(filtered_reverse_reads[random_samples]) + 
  labs(title = "Trimmed Reverse Read Quality")

# Put the two plots together 
forward_filteredQual_plot_12 + reverse_filteredQual_plot_12

Aggregated Trimmed Plots

# Aggregate all QC plots 
# Forward reads
forward_postQC_plot <- 
  plotQualityProfile(filtered_forward_reads, aggregate = TRUE) + 
  labs(title = "Forward Post-QC")

# reverse reads
reverse_postQC_plot <- 
  plotQualityProfile(filtered_reverse_reads, aggregate = TRUE) + 
  labs(title = "Reverse Post-QC")

postQC_aggregate_plot <- 
  # Plot the forward and reverse together 
  forward_postQC_plot + reverse_postQC_plot

# Show the plot
postQC_aggregate_plot

[Insert some interpretation regarding the quality of the raw QC plots]

Here, we see the sequences are improved from before. The very low quality that we saw in the pre-QC plots is mostly gone. If we wanted, we could trim more of the forward sequence. However, the quality here is above a phred score of 30, which I feel ok with.

Stats on read output from filterAndTrim

# Make output into dataframe 
filtered_df <- as.data.frame(filtered_reads)
head(filtered_df)
##                                            reads.in reads.out
## 022um-Control_S36_L001_R1_001.fastq.gz         8722      8023
## 20210602-MA-ABB1F_S31_L001_R1_001.fastq.gz     5156      4632
## 20210602-MA-ABB1P_S28_L001_R1_001.fastq.gz     5051      4597
## 20210602-MA-ABB1W_S58_L001_R1_001.fastq.gz     7695      7023
## 20210602-MA-ABS1F_S45_L001_R1_001.fastq.gz     5098      4595
## 20210602-MA-ABS1P_S65_L001_R1_001.fastq.gz     5900      5314
# calculate some stats 
filtered_df %>%
  reframe(median_reads_in = median(reads.in),
          median_reads_out = median(reads.out),
          median_percent_retained = (median(reads.out)/median(reads.in)))
##   median_reads_in median_reads_out median_percent_retained
## 1            6623             5859               0.8846444

[Insert paragraph interpreting the results above]

  • How many reads got through? Is it “enough”?
  • Should you play with the paratmeters in filterAndTrim() more? If so, which parameters?

Visualize QC differences in plot

# Plot the pre and post together in one plot
preQC_aggregate_plot / postQC_aggregate_plot

Here we can see how many total reads were removed from the QC (comparing the red numbers in the bottom of each plot.) We’d like to see the QC get better.

Error Modelling

Note every sequencing run needs to be run separately! The error model MUST be run separately on each Illumina dataset. This is because every Illumina run is different, even if the flow cell and DNA/samples are the same. If you’d like to combine the datasets from multiple Illumina sequencing runs, you’ll need to do the exact same filterAndTrim() step AND, very importantly, you’ll need to have the same primer/ASV length/16S location expected by the output.

But wait: what contributes to sequencing error in different sequencing runs and why do we need to model errors separately per run with learnErrors() in dada2? Remember the core principles of how Illumina seuqencing works! Some things that contribute to this are:

-Different timings for when clusters go out of sync (drop in quality at end of reads that’s typical of Illumina sequencing) - The cluster density is impossible to exactly replicate. Therefore, the cluster density (and therefore sequence quality) will always be different between sequencing runs (even if it’s the same person/samples/sequencing facility!). -PhiX spike-in will also vary between runs, even if we try to make it the same! Therefore, the amount of heterogeneity on the flow cell will also be different, impacting the quality.
-Different locations on the flow cell can be impacted differently between runs. Perhaps an air bubble can get in, the cluster density happened to be higher/lower on a different run/flow cell.

Ok, that said. Let’s now infer error rates for all possible transitions within purines and pyrimidines (A<>G or C<>T) and transversions between all purine and pyrimidine combinations. The error model is learned by alternating estimation of the error rates and inference of sample composition until they converge. This specifically:

  1. Starts with the assumption that the error rates are the maximum (takes the most abundant sequence (“center”) and assumes it’s the only sequence not caused by errors).
  2. Compares the other sequences to the most abundant sequence.
  3. Uses at most 108 nucleotides for the error estimation.
  4. Uses parametric error estimation function of loess fit on the observed error rates.

Learn the errors

# Forward reads 
error_forward_reads <- 
  learnErrors(filtered_forward_reads, multithread = TRUE)
## 101513344 total bases in 409328 reads from 71 samples will be used for learning the error rates.
# Plot Forward  
forward_error_plot <- 
  plotErrors(error_forward_reads, nominalQ = TRUE) + 
  labs(title = "Forward Read Error Model")

# Reverse reads 
error_reverse_reads <- 
  learnErrors(filtered_reverse_reads, multithread = TRUE)
## 101513344 total bases in 409328 reads from 71 samples will be used for learning the error rates.
# Plot reverse
reverse_error_plot <- 
  plotErrors(error_reverse_reads, nominalQ = TRUE) + 
  labs(title = "Reverse Read Error Model")

# Put the two plots together
forward_error_plot + reverse_error_plot
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.

[Insert paragraph interpreting the plot above above]

  • The error rates for each possible transition (A→C, A→G, …) are shown in the plot above.

Details of the plot: - Points: The observed error rates for each consensus quality score.
- Black line: Estimated error rates after convergence of the machine-learning algorithm.
- Red line: The error rates expected under the nominal definition of the Q-score.

Similar to what is mentioned in the dada2 tutorial: the estimated error rates (black line) are a “reasonably good” fit to the observed rates (points), and the error rates drop with increased quality as expected. We can now infer ASVs!

Infer ASVs

An important note: This process occurs separately on forward and reverse reads! This is quite a different approach from how OTUs are identified in Mothur and also from UCHIME, oligotyping, and other OTU, MED, and ASV approaches.

# Infer ASVs on the forward sequences
dada_forward <- dada(filtered_forward_reads,
                     err = error_forward_reads, 
                     multithread = TRUE)
## Sample 1 - 8023 reads in 903 unique sequences.
## Sample 2 - 4632 reads in 1691 unique sequences.
## Sample 3 - 4597 reads in 2340 unique sequences.
## Sample 4 - 7023 reads in 3070 unique sequences.
## Sample 5 - 4595 reads in 2108 unique sequences.
## Sample 6 - 5314 reads in 2688 unique sequences.
## Sample 7 - 6145 reads in 2899 unique sequences.
## Sample 8 - 5489 reads in 2473 unique sequences.
## Sample 9 - 6954 reads in 2989 unique sequences.
## Sample 10 - 6599 reads in 3046 unique sequences.
## Sample 11 - 5869 reads in 2320 unique sequences.
## Sample 12 - 5632 reads in 2336 unique sequences.
## Sample 13 - 5157 reads in 2130 unique sequences.
## Sample 14 - 3708 reads in 2027 unique sequences.
## Sample 15 - 5753 reads in 2707 unique sequences.
## Sample 16 - 6743 reads in 3242 unique sequences.
## Sample 17 - 5750 reads in 3452 unique sequences.
## Sample 18 - 4411 reads in 2735 unique sequences.
## Sample 19 - 6351 reads in 3734 unique sequences.
## Sample 20 - 6745 reads in 2383 unique sequences.
## Sample 21 - 3451 reads in 1560 unique sequences.
## Sample 22 - 4551 reads in 1837 unique sequences.
## Sample 23 - 6094 reads in 2284 unique sequences.
## Sample 24 - 3506 reads in 1118 unique sequences.
## Sample 25 - 5268 reads in 2153 unique sequences.
## Sample 26 - 5436 reads in 2596 unique sequences.
## Sample 27 - 4199 reads in 2246 unique sequences.
## Sample 28 - 6156 reads in 2902 unique sequences.
## Sample 29 - 8615 reads in 3723 unique sequences.
## Sample 30 - 7402 reads in 3618 unique sequences.
## Sample 31 - 4346 reads in 2061 unique sequences.
## Sample 32 - 3132 reads in 619 unique sequences.
## Sample 33 - 4600 reads in 2251 unique sequences.
## Sample 34 - 9349 reads in 3907 unique sequences.
## Sample 35 - 4949 reads in 2193 unique sequences.
## Sample 36 - 6222 reads in 2872 unique sequences.
## Sample 37 - 4559 reads in 2025 unique sequences.
## Sample 38 - 6836 reads in 2887 unique sequences.
## Sample 39 - 5054 reads in 2434 unique sequences.
## Sample 40 - 6161 reads in 2679 unique sequences.
## Sample 41 - 6482 reads in 2855 unique sequences.
## Sample 42 - 5584 reads in 2781 unique sequences.
## Sample 43 - 5475 reads in 2507 unique sequences.
## Sample 44 - 4930 reads in 2366 unique sequences.
## Sample 45 - 7807 reads in 3404 unique sequences.
## Sample 46 - 5710 reads in 2656 unique sequences.
## Sample 47 - 5948 reads in 2638 unique sequences.
## Sample 48 - 5179 reads in 2414 unique sequences.
## Sample 49 - 6706 reads in 3029 unique sequences.
## Sample 50 - 6419 reads in 2480 unique sequences.
## Sample 51 - 5299 reads in 2321 unique sequences.
## Sample 52 - 5691 reads in 2346 unique sequences.
## Sample 53 - 5849 reads in 2359 unique sequences.
## Sample 54 - 6507 reads in 2714 unique sequences.
## Sample 55 - 6749 reads in 2759 unique sequences.
## Sample 56 - 6 reads in 6 unique sequences.
## Sample 57 - 6150 reads in 3055 unique sequences.
## Sample 58 - 5701 reads in 2543 unique sequences.
## Sample 59 - 4425 reads in 2019 unique sequences.
## Sample 60 - 9652 reads in 2464 unique sequences.
## Sample 61 - 5307 reads in 2359 unique sequences.
## Sample 62 - 5859 reads in 2711 unique sequences.
## Sample 63 - 8015 reads in 3877 unique sequences.
## Sample 64 - 6967 reads in 3167 unique sequences.
## Sample 65 - 5197 reads in 2434 unique sequences.
## Sample 66 - 5745 reads in 2895 unique sequences.
## Sample 67 - 5799 reads in 2664 unique sequences.
## Sample 68 - 4956 reads in 2525 unique sequences.
## Sample 69 - 6634 reads in 3547 unique sequences.
## Sample 70 - 6664 reads in 3327 unique sequences.
## Sample 71 - 6540 reads in 2970 unique sequences.
## Sample 72 - 6871 reads in 3323 unique sequences.
## Sample 73 - 5141 reads in 2474 unique sequences.
## Sample 74 - 5153 reads in 3258 unique sequences.
## Sample 75 - 4952 reads in 3273 unique sequences.
## Sample 76 - 4476 reads in 2962 unique sequences.
## Sample 77 - 6561 reads in 3777 unique sequences.
## Sample 78 - 5046 reads in 2916 unique sequences.
## Sample 79 - 4591 reads in 2856 unique sequences.
## Sample 80 - 6893 reads in 3023 unique sequences.
## Sample 81 - 9379 reads in 4255 unique sequences.
## Sample 82 - 9434 reads in 3987 unique sequences.
## Sample 83 - 5833 reads in 2520 unique sequences.
## Sample 84 - 5721 reads in 2747 unique sequences.
## Sample 85 - 7277 reads in 3185 unique sequences.
## Sample 86 - 7954 reads in 3133 unique sequences.
## Sample 87 - 6338 reads in 2723 unique sequences.
## Sample 88 - 7778 reads in 4114 unique sequences.
## Sample 89 - 6856 reads in 2734 unique sequences.
## Sample 90 - 6215 reads in 1967 unique sequences.
## Sample 91 - 6455 reads in 2584 unique sequences.
## Sample 92 - 8770 reads in 985 unique sequences.
## Sample 93 - 6095 reads in 788 unique sequences.
## Sample 94 - 7038 reads in 3146 unique sequences.
## Sample 95 - 9520 reads in 966 unique sequences.
typeof(dada_forward)
## [1] "list"
# Grab a sample and look at it 
dada_forward$`20211005-MA-CWS1P_R1_filtered.fastq.gz`
## dada-class: object describing DADA2 denoising results
## 241 sequence variants were inferred from 2916 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
# Infer ASVs on the reverse sequences 
dada_reverse <- dada(filtered_reverse_reads,
                     err = error_reverse_reads,
                     multithread = TRUE)
## Sample 1 - 8023 reads in 1355 unique sequences.
## Sample 2 - 4632 reads in 1911 unique sequences.
## Sample 3 - 4597 reads in 2539 unique sequences.
## Sample 4 - 7023 reads in 3424 unique sequences.
## Sample 5 - 4595 reads in 2419 unique sequences.
## Sample 6 - 5314 reads in 2921 unique sequences.
## Sample 7 - 6145 reads in 3182 unique sequences.
## Sample 8 - 5489 reads in 2689 unique sequences.
## Sample 9 - 6954 reads in 3398 unique sequences.
## Sample 10 - 6599 reads in 3342 unique sequences.
## Sample 11 - 5869 reads in 2849 unique sequences.
## Sample 12 - 5632 reads in 2649 unique sequences.
## Sample 13 - 5157 reads in 2481 unique sequences.
## Sample 14 - 3708 reads in 2211 unique sequences.
## Sample 15 - 5753 reads in 3005 unique sequences.
## Sample 16 - 6743 reads in 3589 unique sequences.
## Sample 17 - 5750 reads in 3827 unique sequences.
## Sample 18 - 4411 reads in 3091 unique sequences.
## Sample 19 - 6351 reads in 3971 unique sequences.
## Sample 20 - 6745 reads in 2630 unique sequences.
## Sample 21 - 3451 reads in 1745 unique sequences.
## Sample 22 - 4551 reads in 2079 unique sequences.
## Sample 23 - 6094 reads in 2555 unique sequences.
## Sample 24 - 3506 reads in 1336 unique sequences.
## Sample 25 - 5268 reads in 2483 unique sequences.
## Sample 26 - 5436 reads in 2830 unique sequences.
## Sample 27 - 4199 reads in 2425 unique sequences.
## Sample 28 - 6156 reads in 3180 unique sequences.
## Sample 29 - 8615 reads in 4131 unique sequences.
## Sample 30 - 7402 reads in 4064 unique sequences.
## Sample 31 - 4346 reads in 2313 unique sequences.
## Sample 32 - 3132 reads in 769 unique sequences.
## Sample 33 - 4600 reads in 2464 unique sequences.
## Sample 34 - 9349 reads in 4317 unique sequences.
## Sample 35 - 4949 reads in 2408 unique sequences.
## Sample 36 - 6222 reads in 3200 unique sequences.
## Sample 37 - 4559 reads in 2209 unique sequences.
## Sample 38 - 6836 reads in 3185 unique sequences.
## Sample 39 - 5054 reads in 2664 unique sequences.
## Sample 40 - 6161 reads in 2999 unique sequences.
## Sample 41 - 6482 reads in 3102 unique sequences.
## Sample 42 - 5584 reads in 3135 unique sequences.
## Sample 43 - 5475 reads in 2734 unique sequences.
## Sample 44 - 4930 reads in 2572 unique sequences.
## Sample 45 - 7807 reads in 4124 unique sequences.
## Sample 46 - 5710 reads in 2946 unique sequences.
## Sample 47 - 5948 reads in 3229 unique sequences.
## Sample 48 - 5179 reads in 2650 unique sequences.
## Sample 49 - 6706 reads in 3726 unique sequences.
## Sample 50 - 6419 reads in 2761 unique sequences.
## Sample 51 - 5299 reads in 2576 unique sequences.
## Sample 52 - 5691 reads in 2613 unique sequences.
## Sample 53 - 5849 reads in 2617 unique sequences.
## Sample 54 - 6507 reads in 3044 unique sequences.
## Sample 55 - 6749 reads in 3476 unique sequences.
## Sample 56 - 6 reads in 6 unique sequences.
## Sample 57 - 6150 reads in 3720 unique sequences.
## Sample 58 - 5701 reads in 2818 unique sequences.
## Sample 59 - 4425 reads in 2223 unique sequences.
## Sample 60 - 9652 reads in 2984 unique sequences.
## Sample 61 - 5307 reads in 2633 unique sequences.
## Sample 62 - 5859 reads in 2974 unique sequences.
## Sample 63 - 8015 reads in 4285 unique sequences.
## Sample 64 - 6967 reads in 3484 unique sequences.
## Sample 65 - 5197 reads in 2666 unique sequences.
## Sample 66 - 5745 reads in 3109 unique sequences.
## Sample 67 - 5799 reads in 2970 unique sequences.
## Sample 68 - 4956 reads in 2790 unique sequences.
## Sample 69 - 6634 reads in 3848 unique sequences.
## Sample 70 - 6664 reads in 3719 unique sequences.
## Sample 71 - 6540 reads in 3569 unique sequences.
## Sample 72 - 6871 reads in 3549 unique sequences.
## Sample 73 - 5141 reads in 2691 unique sequences.
## Sample 74 - 5153 reads in 3405 unique sequences.
## Sample 75 - 4952 reads in 3381 unique sequences.
## Sample 76 - 4476 reads in 3116 unique sequences.
## Sample 77 - 6561 reads in 4228 unique sequences.
## Sample 78 - 5046 reads in 3071 unique sequences.
## Sample 79 - 4591 reads in 3023 unique sequences.
## Sample 80 - 6893 reads in 3397 unique sequences.
## Sample 81 - 9379 reads in 4696 unique sequences.
## Sample 82 - 9434 reads in 4325 unique sequences.
## Sample 83 - 5833 reads in 2936 unique sequences.
## Sample 84 - 5721 reads in 3029 unique sequences.
## Sample 85 - 7277 reads in 3561 unique sequences.
## Sample 86 - 7954 reads in 3504 unique sequences.
## Sample 87 - 6338 reads in 3082 unique sequences.
## Sample 88 - 7778 reads in 4504 unique sequences.
## Sample 89 - 6856 reads in 3046 unique sequences.
## Sample 90 - 6215 reads in 2351 unique sequences.
## Sample 91 - 6455 reads in 3231 unique sequences.
## Sample 92 - 8770 reads in 1547 unique sequences.
## Sample 93 - 6095 reads in 1163 unique sequences.
## Sample 94 - 7038 reads in 3304 unique sequences.
## Sample 95 - 9520 reads in 1551 unique sequences.
# Inspect 
dada_reverse[1]
## $`022um-Control_R2_filtered.fastq.gz`
## dada-class: object describing DADA2 denoising results
## 9 sequence variants were inferred from 1355 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
dada_reverse[30]
## $`20210602-MA-SCS1P_R2_filtered.fastq.gz`
## dada-class: object describing DADA2 denoising results
## 218 sequence variants were inferred from 4064 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Merge Forward & Reverse ASVs

Now, merge the forward and reverse ASVs into contigs.

# merge forward and reverse ASVs
merged_ASVs <- mergePairs(dada_forward, filtered_forward_reads, 
                          dada_reverse, filtered_reverse_reads,
                          verbose = TRUE)
## 7972 paired-reads (in 9 unique pairings) successfully merged out of 7972 (in 9 pairings) input.
## 4166 paired-reads (in 154 unique pairings) successfully merged out of 4313 (in 194 pairings) input.
## 3702 paired-reads (in 155 unique pairings) successfully merged out of 3896 (in 209 pairings) input.
## 6087 paired-reads (in 177 unique pairings) successfully merged out of 6318 (in 230 pairings) input.
## 3922 paired-reads (in 135 unique pairings) successfully merged out of 4104 (in 175 pairings) input.
## 4335 paired-reads (in 162 unique pairings) successfully merged out of 4509 (in 219 pairings) input.
## 5236 paired-reads (in 164 unique pairings) successfully merged out of 5517 (in 228 pairings) input.
## 4816 paired-reads (in 153 unique pairings) successfully merged out of 4983 (in 198 pairings) input.
## 6088 paired-reads (in 155 unique pairings) successfully merged out of 6341 (in 228 pairings) input.
## 5828 paired-reads (in 168 unique pairings) successfully merged out of 5963 (in 207 pairings) input.
## 5181 paired-reads (in 144 unique pairings) successfully merged out of 5383 (in 199 pairings) input.
## 4875 paired-reads (in 156 unique pairings) successfully merged out of 5083 (in 216 pairings) input.
## 4494 paired-reads (in 143 unique pairings) successfully merged out of 4679 (in 187 pairings) input.
## 3003 paired-reads (in 132 unique pairings) successfully merged out of 3145 (in 175 pairings) input.
## 4871 paired-reads (in 170 unique pairings) successfully merged out of 5131 (in 231 pairings) input.
## 5474 paired-reads (in 177 unique pairings) successfully merged out of 5940 (in 273 pairings) input.
## 4137 paired-reads (in 165 unique pairings) successfully merged out of 4789 (in 295 pairings) input.
## 2836 paired-reads (in 149 unique pairings) successfully merged out of 3373 (in 248 pairings) input.
## 4758 paired-reads (in 197 unique pairings) successfully merged out of 5322 (in 312 pairings) input.
## 6208 paired-reads (in 144 unique pairings) successfully merged out of 6394 (in 175 pairings) input.
## 3007 paired-reads (in 108 unique pairings) successfully merged out of 3103 (in 140 pairings) input.
## 4024 paired-reads (in 107 unique pairings) successfully merged out of 4150 (in 147 pairings) input.
## 5556 paired-reads (in 132 unique pairings) successfully merged out of 5705 (in 184 pairings) input.
## 3222 paired-reads (in 123 unique pairings) successfully merged out of 3311 (in 144 pairings) input.
## 4708 paired-reads (in 129 unique pairings) successfully merged out of 4821 (in 169 pairings) input.
## 4737 paired-reads (in 127 unique pairings) successfully merged out of 4936 (in 177 pairings) input.
## 3384 paired-reads (in 147 unique pairings) successfully merged out of 3612 (in 203 pairings) input.
## 5257 paired-reads (in 155 unique pairings) successfully merged out of 5514 (in 220 pairings) input.
## 7813 paired-reads (in 187 unique pairings) successfully merged out of 8043 (in 261 pairings) input.
## 6236 paired-reads (in 193 unique pairings) successfully merged out of 6526 (in 289 pairings) input.
## 3801 paired-reads (in 121 unique pairings) successfully merged out of 3888 (in 157 pairings) input.
## 3086 paired-reads (in 44 unique pairings) successfully merged out of 3095 (in 51 pairings) input.
## 3789 paired-reads (in 172 unique pairings) successfully merged out of 3931 (in 217 pairings) input.
## 8312 paired-reads (in 201 unique pairings) successfully merged out of 8602 (in 284 pairings) input.
## 4434 paired-reads (in 136 unique pairings) successfully merged out of 4535 (in 170 pairings) input.
## 5336 paired-reads (in 207 unique pairings) successfully merged out of 5561 (in 277 pairings) input.
## 4059 paired-reads (in 140 unique pairings) successfully merged out of 4181 (in 175 pairings) input.
## 6007 paired-reads (in 175 unique pairings) successfully merged out of 6227 (in 241 pairings) input.
## 4343 paired-reads (in 161 unique pairings) successfully merged out of 4477 (in 203 pairings) input.
## 5501 paired-reads (in 171 unique pairings) successfully merged out of 5616 (in 209 pairings) input.
## 5742 paired-reads (in 155 unique pairings) successfully merged out of 5950 (in 201 pairings) input.
## 4627 paired-reads (in 195 unique pairings) successfully merged out of 4802 (in 253 pairings) input.
## 4819 paired-reads (in 168 unique pairings) successfully merged out of 4932 (in 204 pairings) input.
## 4297 paired-reads (in 139 unique pairings) successfully merged out of 4410 (in 178 pairings) input.
## 6725 paired-reads (in 190 unique pairings) successfully merged out of 7033 (in 272 pairings) input.
## 4907 paired-reads (in 144 unique pairings) successfully merged out of 5092 (in 217 pairings) input.
## 5196 paired-reads (in 157 unique pairings) successfully merged out of 5398 (in 205 pairings) input.
## 4464 paired-reads (in 167 unique pairings) successfully merged out of 4620 (in 213 pairings) input.
## 5573 paired-reads (in 160 unique pairings) successfully merged out of 5940 (in 233 pairings) input.
## 5818 paired-reads (in 149 unique pairings) successfully merged out of 5962 (in 183 pairings) input.
## 4494 paired-reads (in 140 unique pairings) successfully merged out of 4711 (in 191 pairings) input.
## 5066 paired-reads (in 140 unique pairings) successfully merged out of 5204 (in 182 pairings) input.
## 5117 paired-reads (in 140 unique pairings) successfully merged out of 5439 (in 203 pairings) input.
## 5619 paired-reads (in 169 unique pairings) successfully merged out of 5858 (in 244 pairings) input.
## 5954 paired-reads (in 143 unique pairings) successfully merged out of 6242 (in 231 pairings) input.
## No paired-reads (in ZERO unique pairings) successfully merged out of 6 pairings) input.
## 4770 paired-reads (in 190 unique pairings) successfully merged out of 5308 (in 293 pairings) input.
## 5061 paired-reads (in 133 unique pairings) successfully merged out of 5229 (in 178 pairings) input.
## 3789 paired-reads (in 103 unique pairings) successfully merged out of 3990 (in 150 pairings) input.
## 9471 paired-reads (in 174 unique pairings) successfully merged out of 9570 (in 208 pairings) input.
## 4614 paired-reads (in 111 unique pairings) successfully merged out of 4854 (in 166 pairings) input.
## 5095 paired-reads (in 172 unique pairings) successfully merged out of 5246 (in 234 pairings) input.
## 6665 paired-reads (in 245 unique pairings) successfully merged out of 7006 (in 360 pairings) input.
## 6060 paired-reads (in 202 unique pairings) successfully merged out of 6260 (in 275 pairings) input.
## 4331 paired-reads (in 140 unique pairings) successfully merged out of 4551 (in 195 pairings) input.
## 4596 paired-reads (in 181 unique pairings) successfully merged out of 4904 (in 263 pairings) input.
## 5073 paired-reads (in 166 unique pairings) successfully merged out of 5232 (in 224 pairings) input.
## 4026 paired-reads (in 149 unique pairings) successfully merged out of 4238 (in 194 pairings) input.
## 5265 paired-reads (in 216 unique pairings) successfully merged out of 5580 (in 306 pairings) input.
## 5494 paired-reads (in 191 unique pairings) successfully merged out of 5776 (in 274 pairings) input.
## 5569 paired-reads (in 178 unique pairings) successfully merged out of 5879 (in 257 pairings) input.
## 5792 paired-reads (in 209 unique pairings) successfully merged out of 6074 (in 304 pairings) input.
## 4362 paired-reads (in 164 unique pairings) successfully merged out of 4522 (in 220 pairings) input.
## 3712 paired-reads (in 164 unique pairings) successfully merged out of 4159 (in 268 pairings) input.
## 3624 paired-reads (in 194 unique pairings) successfully merged out of 3851 (in 270 pairings) input.
## 3013 paired-reads (in 149 unique pairings) successfully merged out of 3441 (in 223 pairings) input.
## 4904 paired-reads (in 172 unique pairings) successfully merged out of 5541 (in 314 pairings) input.
## 3887 paired-reads (in 208 unique pairings) successfully merged out of 4140 (in 285 pairings) input.
## 3299 paired-reads (in 169 unique pairings) successfully merged out of 3684 (in 271 pairings) input.
## 5911 paired-reads (in 172 unique pairings) successfully merged out of 6249 (in 247 pairings) input.
## 8064 paired-reads (in 236 unique pairings) successfully merged out of 8443 (in 354 pairings) input.
## 8328 paired-reads (in 210 unique pairings) successfully merged out of 8625 (in 301 pairings) input.
## 4962 paired-reads (in 159 unique pairings) successfully merged out of 5288 (in 225 pairings) input.
## 4665 paired-reads (in 177 unique pairings) successfully merged out of 4973 (in 254 pairings) input.
## 6274 paired-reads (in 178 unique pairings) successfully merged out of 6551 (in 249 pairings) input.
## 7003 paired-reads (in 143 unique pairings) successfully merged out of 7377 (in 221 pairings) input.
## 5374 paired-reads (in 163 unique pairings) successfully merged out of 5702 (in 248 pairings) input.
## 6236 paired-reads (in 236 unique pairings) successfully merged out of 6546 (in 331 pairings) input.
## 6222 paired-reads (in 148 unique pairings) successfully merged out of 6436 (in 210 pairings) input.
## 5788 paired-reads (in 192 unique pairings) successfully merged out of 5969 (in 231 pairings) input.
## 5553 paired-reads (in 131 unique pairings) successfully merged out of 5917 (in 205 pairings) input.
## 8748 paired-reads (in 12 unique pairings) successfully merged out of 8754 (in 14 pairings) input.
## 6084 paired-reads (in 13 unique pairings) successfully merged out of 6084 (in 13 pairings) input.
## 6062 paired-reads (in 50 unique pairings) successfully merged out of 6820 (in 218 pairings) input.
## 9507 paired-reads (in 3 unique pairings) successfully merged out of 9507 (in 3 pairings) input.
# Evaluate the output 
typeof(merged_ASVs)
## [1] "list"
length(merged_ASVs)
## [1] 95
names(merged_ASVs)
##  [1] "022um-Control_R1_filtered.fastq.gz"    
##  [2] "20210602-MA-ABB1F_R1_filtered.fastq.gz"
##  [3] "20210602-MA-ABB1P_R1_filtered.fastq.gz"
##  [4] "20210602-MA-ABB1W_R1_filtered.fastq.gz"
##  [5] "20210602-MA-ABS1F_R1_filtered.fastq.gz"
##  [6] "20210602-MA-ABS1P_R1_filtered.fastq.gz"
##  [7] "20210602-MA-ABS1W_R1_filtered.fastq.gz"
##  [8] "20210602-MA-CEB1F_R1_filtered.fastq.gz"
##  [9] "20210602-MA-CEB1P_R1_filtered.fastq.gz"
## [10] "20210602-MA-CEB1W_R1_filtered.fastq.gz"
## [11] "20210602-MA-CES1F_R1_filtered.fastq.gz"
## [12] "20210602-MA-CES1P_R1_filtered.fastq.gz"
## [13] "20210602-MA-CES1W_R1_filtered.fastq.gz"
## [14] "20210602-MA-CWB1F_R1_filtered.fastq.gz"
## [15] "20210602-MA-CWB1P_R1_filtered.fastq.gz"
## [16] "20210602-MA-CWB1W_R1_filtered.fastq.gz"
## [17] "20210602-MA-CWS1F_R1_filtered.fastq.gz"
## [18] "20210602-MA-CWS1P_R1_filtered.fastq.gz"
## [19] "20210602-MA-CWS1W_R1_filtered.fastq.gz"
## [20] "20210602-MA-MBB1F_R1_filtered.fastq.gz"
## [21] "20210602-MA-MBB1P_R1_filtered.fastq.gz"
## [22] "20210602-MA-MBB1W_R1_filtered.fastq.gz"
## [23] "20210602-MA-MBS1F_R1_filtered.fastq.gz"
## [24] "20210602-MA-MBS1P_R1_filtered.fastq.gz"
## [25] "20210602-MA-MBS1W_R1_filtered.fastq.gz"
## [26] "20210602-MA-SCB1F_R1_filtered.fastq.gz"
## [27] "20210602-MA-SCB1P_R1_filtered.fastq.gz"
## [28] "20210602-MA-SCB1W_R1_filtered.fastq.gz"
## [29] "20210602-MA-SCS1F_R1_filtered.fastq.gz"
## [30] "20210602-MA-SCS1P_R1_filtered.fastq.gz"
## [31] "20210602-MA-SCS1W_R1_filtered.fastq.gz"
## [32] "20210615-MA-ABB2F_R1_filtered.fastq.gz"
## [33] "20210615-MA-ABB2P_R1_filtered.fastq.gz"
## [34] "20210615-MA-ABB2W_R1_filtered.fastq.gz"
## [35] "20210615-MA-ABS1F_R1_filtered.fastq.gz"
## [36] "20210615-MA-ABS1P_R1_filtered.fastq.gz"
## [37] "20210615-MA-ABS1W_R1_filtered.fastq.gz"
## [38] "20210615-MA-CEB1F_R1_filtered.fastq.gz"
## [39] "20210615-MA-CEB1P_R1_filtered.fastq.gz"
## [40] "20210615-MA-CEB1W_R1_filtered.fastq.gz"
## [41] "20210615-MA-CES1F_R1_filtered.fastq.gz"
## [42] "20210615-MA-CES1P_R1_filtered.fastq.gz"
## [43] "20210615-MA-CES1W_R1_filtered.fastq.gz"
## [44] "20210615-MA-CWB1F_R1_filtered.fastq.gz"
## [45] "20210615-MA-CWB1P_R1_filtered.fastq.gz"
## [46] "20210615-MA-CWB1W_R1_filtered.fastq.gz"
## [47] "20210615-MA-CWS1F_R1_filtered.fastq.gz"
## [48] "20210615-MA-CWS1P_R1_filtered.fastq.gz"
## [49] "20210615-MA-CWS1W_R1_filtered.fastq.gz"
## [50] "20210615-MA-MBB1F_R1_filtered.fastq.gz"
## [51] "20210615-MA-MBB1P_R1_filtered.fastq.gz"
## [52] "20210615-MA-MBB1W_R1_filtered.fastq.gz"
## [53] "20210615-MA-MBS1F_R1_filtered.fastq.gz"
## [54] "20210615-MA-MBS1P_R1_filtered.fastq.gz"
## [55] "20210615-MA-MBS1W_R1_filtered.fastq.gz"
## [56] "20210615-MA-SCB2F_R1_filtered.fastq.gz"
## [57] "20210615-MA-SCB2P_R1_filtered.fastq.gz"
## [58] "20210615-MA-SCB2W_R1_filtered.fastq.gz"
## [59] "20210615-MA-SCS2F_R1_filtered.fastq.gz"
## [60] "20210615-MA-SCS2P_R1_filtered.fastq.gz"
## [61] "20210615-MA-SCS2W_R1_filtered.fastq.gz"
## [62] "20211005-MA-ABB2F_R1_filtered.fastq.gz"
## [63] "20211005-MA-ABB2P_R1_filtered.fastq.gz"
## [64] "20211005-MA-ABB2W_R1_filtered.fastq.gz"
## [65] "20211005-MA-ABS1F_R1_filtered.fastq.gz"
## [66] "20211005-MA-ABS1P_R1_filtered.fastq.gz"
## [67] "20211005-MA-ABS1W_R1_filtered.fastq.gz"
## [68] "20211005-MA-CEB1F_R1_filtered.fastq.gz"
## [69] "20211005-MA-CEB1P_R1_filtered.fastq.gz"
## [70] "20211005-MA-CEB1W_R1_filtered.fastq.gz"
## [71] "20211005-MA-CES2F_R1_filtered.fastq.gz"
## [72] "20211005-MA-CES2P_R1_filtered.fastq.gz"
## [73] "20211005-MA-CES2W_R1_filtered.fastq.gz"
## [74] "20211005-MA-CWB1F_R1_filtered.fastq.gz"
## [75] "20211005-MA-CWB1P_R1_filtered.fastq.gz"
## [76] "20211005-MA-CWB1W_R1_filtered.fastq.gz"
## [77] "20211005-MA-CWS1F_R1_filtered.fastq.gz"
## [78] "20211005-MA-CWS1P_R1_filtered.fastq.gz"
## [79] "20211005-MA-CWS1W_R1_filtered.fastq.gz"
## [80] "20211005-MA-MBB1F_R1_filtered.fastq.gz"
## [81] "20211005-MA-MBB1P_R1_filtered.fastq.gz"
## [82] "20211005-MA-MBB1W_R1_filtered.fastq.gz"
## [83] "20211005-MA-MBS1F_R1_filtered.fastq.gz"
## [84] "20211005-MA-MBS1P_R1_filtered.fastq.gz"
## [85] "20211005-MA-MBS1W_R1_filtered.fastq.gz"
## [86] "20211005-MA-SCB1F_R1_filtered.fastq.gz"
## [87] "20211005-MA-SCB1P_R1_filtered.fastq.gz"
## [88] "20211005-MA-SCB1W_R1_filtered.fastq.gz"
## [89] "20211005-MA-SCS2F_R1_filtered.fastq.gz"
## [90] "20211005-MA-SCS2P_R1_filtered.fastq.gz"
## [91] "20211005-MA-SCS2W_R1_filtered.fastq.gz"
## [92] "3um-Control_R1_filtered.fastq.gz"      
## [93] "DNA-Ext-Control_R1_filtered.fastq.gz"  
## [94] "MockZymoPos_R1_filtered.fastq.gz"      
## [95] "WaterControl_R1_filtered.fastq.gz"
# Inspect the merger data.frame from the 20210602-MA-ABB1P 
head(merged_ASVs[[3]])
##                                                                                                                                                                                                                                                    sequence
## 1 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTGTCAAGTCTGCTGTTAAAGCGTGGAGCTTAACTCCATTTCGGCAGTGGAAACTGACAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## 2  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGACCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## 3 CGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCTGTAGGTGGTTTAATAAGTCAACTGTTAAATCTTGAGGCTCAACTTCAAAATCGCAGTCGAAACTATTAGACTAGAGTATAGTAGAGGTAAAGGGAATTTCCAGTGGAGCGGTGAAATGCGTAGATATTGGAAAGAACACCGATGGCGAAAGCACTTTACTGGGCTATTACTAACACTCAGAGACGAAAGCTAGGGTAGCAAATG
## 4 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTTGTAAGTCTGTTGTTAAAGCGTGGAGCTTAACTCCATTTCAGCAATGGAAACTGTAAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## 5 CGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTATTAAGTTGGGTGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGATTCACTAGAGTACGAAAGAGGGAGGTAGAATTCACAGTGTAGCGGTGGAATGCGTAGATATTGTGAAGAATACCAATGGCGAAGGCAGCCTCCTGGTTCTGTACTGACACTGAGGTGCGAAAGCGTGGGTAGCGAACA
## 6  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGGCCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1       416       1       1    247         0      0      1   TRUE
## 2       315       2       2    248         0      0      1   TRUE
## 3       198       4       4    247         0      0      1   TRUE
## 4       176       3       3    247         0      0      1   TRUE
## 5       164       5       5    247         0      0      1   TRUE
## 6       103      79      76    248         0      0      1   TRUE

Create Raw ASV Count Table

# Create the ASV Count Table 
raw_ASV_table <- makeSequenceTable(merged_ASVs)

# Write out the file to data/01_DADA2


# Check the type and dimensions of the data
dim(raw_ASV_table)
## [1]   95 3442
class(raw_ASV_table)
## [1] "matrix" "array"
typeof(raw_ASV_table)
## [1] "integer"
# Inspect the distribution of sequence lengths of all ASVs in dataset 
table(nchar(getSequences(raw_ASV_table)))
## 
##  248  249  250  251  252  253  254  255  258  262  264  266  268  279  290  299 
##  179 3097   87   11    5    1    1    2    1    1    1    1    1    1    1    1 
##  337  341  342  344  345  353  372  393  409  410  411  412  413  414  415  416 
##    1    2    1    1    1    2    1    1    2    1    2    2    4    4    9    4 
##  417  418  419  422  425  432 
##    6    3    1    1    1    1
# Inspect the distribution of sequence lengths of all ASVs in dataset 
# AFTER TRIM
data.frame(Seq_Length = nchar(getSequences(raw_ASV_table))) %>%
  ggplot(aes(x = Seq_Length )) + 
  geom_histogram() + 
  labs(title = "Raw distribution of ASV length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

###################################################
###################################################
# TRIM THE ASVS
# Let's trim the ASVs to only be the right size, which is 249.
# 249 originates from our expected amplicon of 252 - 3bp in the forward read due to low quality.

# We will allow for a few 
raw_ASV_table_trimmed <- raw_ASV_table[,nchar(colnames(raw_ASV_table)) %in% 248:250]

# Inspect the distribution of sequence lengths of all ASVs in dataset 
table(nchar(getSequences(raw_ASV_table_trimmed)))
## 
##  248  249  250 
##  179 3097   87
# What proportion is left of the sequences? 
sum(raw_ASV_table_trimmed)/sum(raw_ASV_table)
## [1] 0.9973974
# Inspect the distribution of sequence lengths of all ASVs in dataset 
# AFTER TRIM
data.frame(Seq_Length = nchar(getSequences(raw_ASV_table_trimmed))) %>%
  ggplot(aes(x = Seq_Length )) + 
  geom_histogram() + 
  labs(title = "Trimmed distribution of ASV length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Note the peak at 249 is ABOVE 3000

# Let's zoom in on the plot 
data.frame(Seq_Length = nchar(getSequences(raw_ASV_table_trimmed))) %>%
  ggplot(aes(x = Seq_Length )) + 
  geom_histogram() + 
  labs(title = "Trimmed distribution of ASV length") + 
  scale_y_continuous(limits = c(0, 500))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

Taking into account the lower, zoomed-in plot. Do we want to remove those extra ASVs?

Remove Chimeras

Sometimes chimeras arise in our workflow.

Chimeric sequences are artificial sequences formed by the combination of two or more distinct biological sequences. These chimeric sequences can arise during the polymerase chain reaction (PCR) amplification step of the 16S rRNA gene, where fragments from different templates can be erroneously joined together.

Chimera removal is an essential step in the analysis of 16S sequencing data to improve the accuracy of downstream analyses, such as taxonomic assignment and diversity assessment. It helps to avoid the inclusion of misleading or spurious sequences that could lead to incorrect biological interpretations.

# Remove the chimeras in the raw ASV table
noChimeras_ASV_table <- removeBimeraDenovo(raw_ASV_table_trimmed, 
                                           method="consensus", 
                                           multithread=TRUE, verbose=TRUE)
## Identified 283 bimeras out of 3363 input sequences.
# Check the dimensions
dim(noChimeras_ASV_table)
## [1]   95 3080
# What proportion is left of the sequences? 
sum(noChimeras_ASV_table)/sum(raw_ASV_table_trimmed)
## [1] 0.9762963
sum(noChimeras_ASV_table)/sum(raw_ASV_table)
## [1] 0.9737554
# Plot it 
data.frame(Seq_Length_NoChim = nchar(getSequences(noChimeras_ASV_table))) %>%
  ggplot(aes(x = Seq_Length_NoChim )) + 
  geom_histogram()+ 
  labs(title = "Trimmed + Chimera Removal distribution of ASV length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Note the difference in the peak at 249, which is now BELOW 3000

Track the read counts

Here, we will look at the number of reads that were lost in the filtering, denoising, merging, and chimera removal.

# A little function to identify number seqs 
getN <- function(x) sum(getUniques(x))

# Make the table to track the seqs 
track <- cbind(filtered_reads, 
               sapply(dada_forward, getN),
               sapply(dada_reverse, getN),
               sapply(merged_ASVs, getN),
               rowSums(noChimeras_ASV_table))

head(track)
##                                            reads.in reads.out               
## 022um-Control_S36_L001_R1_001.fastq.gz         8722      8023 8009 7975 7972
## 20210602-MA-ABB1F_S31_L001_R1_001.fastq.gz     5156      4632 4411 4366 4166
## 20210602-MA-ABB1P_S28_L001_R1_001.fastq.gz     5051      4597 4046 4010 3702
## 20210602-MA-ABB1W_S58_L001_R1_001.fastq.gz     7695      7023 6460 6424 6087
## 20210602-MA-ABS1F_S45_L001_R1_001.fastq.gz     5098      4595 4208 4180 3922
## 20210602-MA-ABS1P_S65_L001_R1_001.fastq.gz     5900      5314 4646 4663 4335
##                                                
## 022um-Control_S36_L001_R1_001.fastq.gz     7972
## 20210602-MA-ABB1F_S31_L001_R1_001.fastq.gz 4143
## 20210602-MA-ABB1P_S28_L001_R1_001.fastq.gz 3657
## 20210602-MA-ABB1W_S58_L001_R1_001.fastq.gz 5999
## 20210602-MA-ABS1F_S45_L001_R1_001.fastq.gz 3874
## 20210602-MA-ABS1P_S65_L001_R1_001.fastq.gz 4324
# Update column names to be more informative (most are missing at the moment!)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nochim")
rownames(track) <- samples

# Generate a dataframe to track the reads through our DADA2 pipeline
track_counts_df <- 
  track %>%
  # make it a dataframe
  as.data.frame() %>%
  rownames_to_column(var = "names") %>%
  mutate(perc_reads_retained = 100 * nochim / input)

# Visualize it in table format 
DT::datatable(track_counts_df)
# Plot it!
track_counts_df %>%
  pivot_longer(input:nochim, names_to = "read_type", values_to = "num_reads") %>%
  mutate(read_type = fct_relevel(read_type, 
                                 "input", "filtered", "denoisedF", "denoisedR", "merged", "nochim")) %>%
  ggplot(aes(x = read_type, y = num_reads, fill = read_type)) + 
  geom_line(aes(group = names), color = "grey") + 
  geom_point(shape = 21, size = 3, alpha = 0.8) + 
  scale_fill_brewer(palette = "Spectral") + 
  labs(x = "Filtering Step", y = "Number of Sequences") + 
  theme_bw()

Assign Taxonomy

Here, we will use the silva database version 138!

# Classify the ASVs against a reference set using the RDP Naive Bayesian Classifier described by Wang et al., (2007) in AEM
taxa_train <- 
  assignTaxonomy(noChimeras_ASV_table, 
                 "/workdir/in_class_data/taxonomy/silva_nr99_v138.1_train_set.fa.gz", 
                 multithread=TRUE)

# Add the genus/species information 
taxa_addSpecies <- 
  addSpecies(taxa_train, 
             "/workdir/in_class_data/taxonomy/silva_species_assignment_v138.1.fa.gz")

# Inspect the taxonomy 
taxa_print <- taxa_addSpecies # Removing sequence rownames for display only
rownames(taxa_print) <- NULL
#View(taxa_print)

Prepare the data for export!

1. ASV Table

Below, we will prepare the following:

  1. Two ASV Count tables:
    1. With ASV seqs: ASV headers include the entire ASV sequence ~250bps.
    2. with ASV names: This includes re-written and shortened headers like ASV_1, ASV_2, etc, which will match the names in our fasta file below.
  2. ASV_fastas: A fasta file that we can use to build a tree for phylogenetic analyses (e.g. phylogenetic alpha diversity metrics or UNIFRAC dissimilarty).

Finalize ASV Count Tables

########### 2. COUNT TABLE ###############
############## Modify the ASV names and then save a fasta file!  ############## 
# Give headers more manageable names
# First pull the ASV sequences
asv_seqs <- colnames(noChimeras_ASV_table)
asv_seqs[1:5]
## [1] "CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTGTCAAGTCTGCTGTTAAAGCGTGGAGCTTAACTCCATTTCGGCAGTGGAAACTGACAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG"
## [2] "CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTTGTAAGTCTGTTGTTAAAGCGTGGAGCTTAACTCCATTTCAGCAATGGAAACTGTAAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG"
## [3] "CAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA"
## [4] "GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGACCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC" 
## [5] "GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGGCCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC"
# make headers for our ASV seq fasta file, which will be our asv names
asv_headers <- vector(dim(noChimeras_ASV_table)[2], mode = "character")
asv_headers[1:5]
## [1] "" "" "" "" ""
# loop through vector and fill it in with ASV names 
for (i in 1:dim(noChimeras_ASV_table)[2]) {
  asv_headers[i] <- paste(">ASV", i, sep = "_")
}

# intitution check
asv_headers[1:5]
## [1] ">ASV_1" ">ASV_2" ">ASV_3" ">ASV_4" ">ASV_5"
##### Rename ASVs in table then write out our ASV fasta file! 
#View(noChimeras_ASV_table)
asv_tab <- t(noChimeras_ASV_table)
#View(asv_tab)

## Rename our asvs! 
row.names(asv_tab) <- sub(">", "", asv_headers)
#View(asv_tab)

2. Taxonomy Table

# Inspect the taxonomy table
#View(taxa_addSpecies)

##### Prepare tax table 
# Add the ASV sequences from the rownames to a column 
new_tax_tab <- 
  taxa_addSpecies%>%
  as.data.frame() %>%
  rownames_to_column(var = "ASVseqs") 
head(new_tax_tab)
##                                                                                                                                                                                                                                                     ASVseqs
## 1 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTGTCAAGTCTGCTGTTAAAGCGTGGAGCTTAACTCCATTTCGGCAGTGGAAACTGACAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## 2 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTTGTAAGTCTGTTGTTAAAGCGTGGAGCTTAACTCCATTTCAGCAATGGAAACTGTAAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## 3 CAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA
## 4  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGACCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## 5  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGGCCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## 6 CGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTATTAAGTTGGGTGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGATTCACTAGAGTACGAAAGAGGGAGGTAGAATTCACAGTGTAGCGGTGGAATGCGTAGATATTGTGAAGAATACCAATGGCGAAGGCAGCCTCCTGGTTCTGTACTGACACTGAGGTGCGAAAGCGTGGGTAGCGAACA
##    Kingdom           Phylum               Class           Order
## 1 Bacteria    Cyanobacteria      Cyanobacteriia Synechococcales
## 2 Bacteria    Cyanobacteria      Cyanobacteriia Synechococcales
## 3 Bacteria   Proteobacteria Gammaproteobacteria Pseudomonadales
## 4 Bacteria Actinobacteriota      Acidimicrobiia Actinomarinales
## 5 Bacteria Actinobacteriota      Acidimicrobiia Actinomarinales
## 6 Bacteria   Proteobacteria Gammaproteobacteria Pseudomonadales
##             Family                   Genus Species
## 1     Cyanobiaceae      Cyanobium PCC-6307    <NA>
## 2     Cyanobiaceae      Cyanobium PCC-6307    <NA>
## 3 Pseudomonadaceae             Pseudomonas    <NA>
## 4 Actinomarinaceae Candidatus Actinomarina    <NA>
## 5 Actinomarinaceae Candidatus Actinomarina    <NA>
## 6      SAR86 clade                    <NA>    <NA>
# intution check 
stopifnot(new_tax_tab$ASVseqs == colnames(noChimeras_ASV_table))

# Now let's add the ASV names 
rownames(new_tax_tab) <- rownames(asv_tab)
head(new_tax_tab)
##                                                                                                                                                                                                                                                         ASVseqs
## ASV_1 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTGTCAAGTCTGCTGTTAAAGCGTGGAGCTTAACTCCATTTCGGCAGTGGAAACTGACAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## ASV_2 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTTGTAAGTCTGTTGTTAAAGCGTGGAGCTTAACTCCATTTCAGCAATGGAAACTGTAAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## ASV_3 CAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA
## ASV_4  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGACCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## ASV_5  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGGCCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## ASV_6 CGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTATTAAGTTGGGTGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGATTCACTAGAGTACGAAAGAGGGAGGTAGAATTCACAGTGTAGCGGTGGAATGCGTAGATATTGTGAAGAATACCAATGGCGAAGGCAGCCTCCTGGTTCTGTACTGACACTGAGGTGCGAAAGCGTGGGTAGCGAACA
##        Kingdom           Phylum               Class           Order
## ASV_1 Bacteria    Cyanobacteria      Cyanobacteriia Synechococcales
## ASV_2 Bacteria    Cyanobacteria      Cyanobacteriia Synechococcales
## ASV_3 Bacteria   Proteobacteria Gammaproteobacteria Pseudomonadales
## ASV_4 Bacteria Actinobacteriota      Acidimicrobiia Actinomarinales
## ASV_5 Bacteria Actinobacteriota      Acidimicrobiia Actinomarinales
## ASV_6 Bacteria   Proteobacteria Gammaproteobacteria Pseudomonadales
##                 Family                   Genus Species
## ASV_1     Cyanobiaceae      Cyanobium PCC-6307    <NA>
## ASV_2     Cyanobiaceae      Cyanobium PCC-6307    <NA>
## ASV_3 Pseudomonadaceae             Pseudomonas    <NA>
## ASV_4 Actinomarinaceae Candidatus Actinomarina    <NA>
## ASV_5 Actinomarinaceae Candidatus Actinomarina    <NA>
## ASV_6      SAR86 clade                    <NA>    <NA>
### Final prep of tax table. Add new column with ASV names 
asv_tax <- 
  new_tax_tab %>%
  # add rownames from count table for phyloseq handoff
  mutate(ASV = rownames(asv_tab)) %>%
  # Resort the columns with select
  dplyr::select(Kingdom, Phylum, Class, Order, Family, Genus, Species, ASV, ASVseqs)

head(asv_tax)
##        Kingdom           Phylum               Class           Order
## ASV_1 Bacteria    Cyanobacteria      Cyanobacteriia Synechococcales
## ASV_2 Bacteria    Cyanobacteria      Cyanobacteriia Synechococcales
## ASV_3 Bacteria   Proteobacteria Gammaproteobacteria Pseudomonadales
## ASV_4 Bacteria Actinobacteriota      Acidimicrobiia Actinomarinales
## ASV_5 Bacteria Actinobacteriota      Acidimicrobiia Actinomarinales
## ASV_6 Bacteria   Proteobacteria Gammaproteobacteria Pseudomonadales
##                 Family                   Genus Species   ASV
## ASV_1     Cyanobiaceae      Cyanobium PCC-6307    <NA> ASV_1
## ASV_2     Cyanobiaceae      Cyanobium PCC-6307    <NA> ASV_2
## ASV_3 Pseudomonadaceae             Pseudomonas    <NA> ASV_3
## ASV_4 Actinomarinaceae Candidatus Actinomarina    <NA> ASV_4
## ASV_5 Actinomarinaceae Candidatus Actinomarina    <NA> ASV_5
## ASV_6      SAR86 clade                    <NA>    <NA> ASV_6
##                                                                                                                                                                                                                                                         ASVseqs
## ASV_1 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTGTCAAGTCTGCTGTTAAAGCGTGGAGCTTAACTCCATTTCGGCAGTGGAAACTGACAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## ASV_2 CGGGAGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGCAGGCGGTCTTGTAAGTCTGTTGTTAAAGCGTGGAGCTTAACTCCATTTCAGCAATGGAAACTGTAAGACTAGAGTGTGGTAGGGGCAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAAGAACACCAGTGGCGAAGGCGCTCTGCTGGGCCATAACTGACGCTCATGGACGAAAGCCAGGGGAGCGAAAG
## ASV_3 CAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA
## ASV_4  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGACCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## ASV_5  GTAGGGGGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAACAAGTCGGTCGTGAAAGTTCAGGGCTCAACCCTGAAATGTCGATCGATACTGTTGTGACTAGGATACGGTAGAGGTGAGTGGAATTCCGAGTGTAGCGGTGAAATGCGTAGATATTCGGAGGAACACCAATTGCGAAGGCAGCTCACTGGGCCGCTATCGACGCTGAGGAGCGAAAGCTAGGGGAGCAAAC
## ASV_6 CGGAAGGTCCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTATTAAGTTGGGTGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGATTCACTAGAGTACGAAAGAGGGAGGTAGAATTCACAGTGTAGCGGTGGAATGCGTAGATATTGTGAAGAATACCAATGGCGAAGGCAGCCTCCTGGTTCTGTACTGACACTGAGGTGCGAAAGCGTGGGTAGCGAACA
# Intution check
stopifnot(asv_tax$ASV == rownames(asv_tax), rownames(asv_tax) == rownames(asv_tab))

Write 01_DADA2 files

Now, we will write the files! We will write the following to the data/01_DADA2/ folder. We will save both as files that could be submitted as supplements AND as .RData objects for easy loading into the next steps into R.:

  1. ASV_counts.tsv: ASV count table that has ASV names that are re-written and shortened headers like ASV_1, ASV_2, etc, which will match the names in our fasta file below. This will also be saved as data/01_DADA2/ASV_counts.RData.
  2. ASV_counts_withSeqNames.tsv: This is generated with the data object in this file known as noChimeras_ASV_table. ASV headers include the entire ASV sequence ~250bps. In addition, we will save this as a .RData object as data/01_DADA2/noChimeras_ASV_table.RData as we will use this data in analysis/02_Taxonomic_Assignment.Rmd to assign the taxonomy from the sequence headers.
  3. ASVs.fasta: A fasta file output of the ASV names from ASV_counts.tsv and the sequences from the ASVs in ASV_counts_withSeqNames.tsv. A fasta file that we can use to build a tree for phylogenetic analyses (e.g. phylogenetic alpha diversity metrics or UNIFRAC dissimilarty).
  4. We will also make a copy of ASVs.fasta in data/02_TaxAss_FreshTrain/ to be used for the taxonomy classification in the next step in the workflow.
  5. Write out the taxonomy table
  6. track_read_counts.RData: To track how many reads we lost throughout our workflow that could be used and plotted later. We will add this to the metadata in analysis/02_Taxonomic_Assignment.Rmd.
# FIRST, we will save our output as regular files, which will be useful later on. 
# Save to regular .tsv file 
# Write BOTH the modified and unmodified ASV tables to a file!
# Write count table with ASV numbered names (e.g. ASV_1, ASV_2, etc)
write.table(asv_tab, "data/01_DADA2/ASV_counts.tsv", sep = "\t", quote = FALSE, col.names = NA)
# Write count table with ASV sequence names
write.table(noChimeras_ASV_table, "data/01_DADA2/ASV_counts_withSeqNames.tsv", sep = "\t", quote = FALSE, col.names = NA)
# Write out the fasta file for reference later on for what seq matches what ASV
asv_fasta <- c(rbind(asv_headers, asv_seqs))
# Save to a file!
write(asv_fasta, "data/01_DADA2/ASVs.fasta")


# SECOND, let's save the taxonomy tables 
# Write the table 
write.table(asv_tax, "data/01_DADA2/ASV_taxonomy.tsv", sep = "\t", quote = FALSE, col.names = NA)


# THIRD, let's save to a RData object 
# Each of these files will be used in the analysis/02_Taxonomic_Assignment
# RData objects are for easy loading :) 
save(noChimeras_ASV_table, file = "data/01_DADA2/noChimeras_ASV_table.RData")
save(asv_tab, file = "data/01_DADA2/ASV_counts.RData")
# And save the track_counts_df a R object, which we will merge with metadata information in the next step of the analysis in nalysis/02_Taxonomic_Assignment. 
save(track_counts_df, file = "data/01_DADA2/track_read_counts.RData")

Check Render Time

# Take the time now that we are at the end of the script
end_time <- Sys.time()
end_time 
## [1] "2024-03-09 14:51:31 EST"
# Echo the elapsed time
elapsed_time <- round((end_time - start_time), 3)
elapsed_time
## Time difference of 20.389 mins

Session Information

# Ensure reproducibility 
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.2 (2023-10-31)
##  os       Rocky Linux 9.0 (Blue Onyx)
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/New_York
##  date     2024-03-09
##  pandoc   3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package              * version   date (UTC) lib source
##  abind                  1.4-5     2016-07-21 [2] CRAN (R 4.3.2)
##  ade4                   1.7-22    2023-02-06 [1] CRAN (R 4.3.2)
##  ape                    5.7-1     2023-03-13 [2] CRAN (R 4.3.2)
##  Biobase                2.62.0    2023-10-24 [2] Bioconductor
##  BiocGenerics           0.48.1    2023-11-01 [2] Bioconductor
##  BiocManager          * 1.30.22   2023-08-08 [1] CRAN (R 4.3.2)
##  BiocParallel           1.36.0    2023-10-24 [2] Bioconductor
##  biomformat             1.30.0    2023-10-24 [1] Bioconductor
##  Biostrings             2.70.2    2024-01-28 [1] Bioconductor 3.18 (R 4.3.2)
##  bitops                 1.0-7     2021-04-24 [2] CRAN (R 4.3.2)
##  bslib                  0.5.1     2023-08-11 [2] CRAN (R 4.3.2)
##  cachem                 1.0.8     2023-05-01 [2] CRAN (R 4.3.2)
##  cli                    3.6.2     2023-12-11 [1] CRAN (R 4.3.2)
##  cluster                2.1.4     2022-08-22 [2] CRAN (R 4.3.2)
##  codetools              0.2-19    2023-02-01 [2] CRAN (R 4.3.2)
##  colorspace             2.1-0     2023-01-23 [2] CRAN (R 4.3.2)
##  crayon                 1.5.2     2022-09-29 [2] CRAN (R 4.3.2)
##  crosstalk              1.2.0     2021-11-04 [2] CRAN (R 4.3.2)
##  dada2                * 1.16.0    2024-03-06 [1] Github (benjjneb/dada2@8bedfb4)
##  data.table             1.14.8    2023-02-17 [2] CRAN (R 4.3.2)
##  DelayedArray           0.28.0    2023-10-24 [2] Bioconductor
##  deldir                 2.0-4     2024-02-28 [1] CRAN (R 4.3.2)
##  devtools             * 2.4.5     2022-10-11 [1] CRAN (R 4.3.2)
##  digest                 0.6.34    2024-01-11 [1] CRAN (R 4.3.2)
##  dplyr                * 1.1.4     2023-11-17 [1] CRAN (R 4.3.2)
##  DT                   * 0.32      2024-02-19 [1] CRAN (R 4.3.2)
##  ellipsis               0.3.2     2021-04-29 [2] CRAN (R 4.3.2)
##  evaluate               0.23      2023-11-01 [2] CRAN (R 4.3.2)
##  fansi                  1.0.6     2023-12-08 [1] CRAN (R 4.3.2)
##  farver                 2.1.1     2022-07-06 [2] CRAN (R 4.3.2)
##  fastmap                1.1.1     2023-02-24 [2] CRAN (R 4.3.2)
##  forcats              * 1.0.0     2023-01-29 [1] CRAN (R 4.3.2)
##  foreach                1.5.2     2022-02-02 [2] CRAN (R 4.3.2)
##  fs                     1.6.3     2023-07-20 [2] CRAN (R 4.3.2)
##  generics               0.1.3     2022-07-05 [2] CRAN (R 4.3.2)
##  GenomeInfoDb           1.38.6    2024-02-08 [1] Bioconductor 3.18 (R 4.3.2)
##  GenomeInfoDbData       1.2.11    2023-11-07 [2] Bioconductor
##  GenomicAlignments      1.38.2    2024-01-16 [1] Bioconductor 3.18 (R 4.3.2)
##  GenomicRanges          1.54.1    2023-10-29 [2] Bioconductor
##  ggplot2              * 3.5.0     2024-02-23 [2] CRAN (R 4.3.2)
##  glue                   1.7.0     2024-01-09 [1] CRAN (R 4.3.2)
##  gtable                 0.3.4     2023-08-21 [2] CRAN (R 4.3.2)
##  highr                  0.10      2022-12-22 [2] CRAN (R 4.3.2)
##  hms                    1.1.3     2023-03-21 [1] CRAN (R 4.3.2)
##  htmltools              0.5.7     2023-11-03 [2] CRAN (R 4.3.2)
##  htmlwidgets            1.6.2     2023-03-17 [2] CRAN (R 4.3.2)
##  httpuv                 1.6.12    2023-10-23 [2] CRAN (R 4.3.2)
##  hwriter                1.3.2.1   2022-04-08 [1] CRAN (R 4.3.2)
##  igraph                 1.5.1     2023-08-10 [2] CRAN (R 4.3.2)
##  iNEXT                * 3.0.0     2022-08-29 [1] CRAN (R 4.3.2)
##  interp                 1.1-6     2024-01-26 [1] CRAN (R 4.3.2)
##  IRanges                2.36.0    2023-10-24 [2] Bioconductor
##  iterators              1.0.14    2022-02-05 [2] CRAN (R 4.3.2)
##  jpeg                   0.1-10    2022-11-29 [1] CRAN (R 4.3.2)
##  jquerylib              0.1.4     2021-04-26 [2] CRAN (R 4.3.2)
##  jsonlite               1.8.8     2023-12-04 [1] CRAN (R 4.3.2)
##  knitr                  1.45      2023-10-30 [2] CRAN (R 4.3.2)
##  labeling               0.4.3     2023-08-29 [2] CRAN (R 4.3.2)
##  later                  1.3.1     2023-05-02 [2] CRAN (R 4.3.2)
##  lattice              * 0.21-9    2023-10-01 [2] CRAN (R 4.3.2)
##  latticeExtra           0.6-30    2022-07-04 [1] CRAN (R 4.3.2)
##  lifecycle              1.0.4     2023-11-07 [1] CRAN (R 4.3.2)
##  lubridate            * 1.9.3     2023-09-27 [1] CRAN (R 4.3.2)
##  magrittr               2.0.3     2022-03-30 [2] CRAN (R 4.3.2)
##  MASS                   7.3-60    2023-05-04 [2] CRAN (R 4.3.2)
##  Matrix                 1.6-1.1   2023-09-18 [2] CRAN (R 4.3.2)
##  MatrixGenerics         1.14.0    2023-10-24 [2] Bioconductor
##  matrixStats            1.2.0     2023-12-11 [1] CRAN (R 4.3.2)
##  memoise                2.0.1     2021-11-26 [2] CRAN (R 4.3.2)
##  mgcv                   1.9-0     2023-07-11 [2] CRAN (R 4.3.2)
##  mime                   0.12      2021-09-28 [2] CRAN (R 4.3.2)
##  miniUI                 0.1.1.1   2018-05-18 [2] CRAN (R 4.3.2)
##  multtest               2.58.0    2023-10-24 [1] Bioconductor
##  munsell                0.5.0     2018-06-12 [2] CRAN (R 4.3.2)
##  nlme                   3.1-163   2023-08-09 [2] CRAN (R 4.3.2)
##  pacman                 0.5.1     2019-03-11 [1] CRAN (R 4.3.2)
##  patchwork            * 1.2.0     2024-01-08 [1] CRAN (R 4.3.2)
##  permute              * 0.9-7     2022-01-27 [1] CRAN (R 4.3.2)
##  phyloseq             * 1.46.0    2023-10-24 [1] Bioconductor
##  pillar                 1.9.0     2023-03-22 [2] CRAN (R 4.3.2)
##  pkgbuild               1.4.3     2023-12-10 [1] CRAN (R 4.3.2)
##  pkgconfig              2.0.3     2019-09-22 [2] CRAN (R 4.3.2)
##  pkgload                1.3.4     2024-01-16 [1] CRAN (R 4.3.2)
##  plyr                   1.8.9     2023-10-02 [2] CRAN (R 4.3.2)
##  png                    0.1-8     2022-11-29 [2] CRAN (R 4.3.2)
##  profvis                0.3.8     2023-05-02 [2] CRAN (R 4.3.2)
##  promises               1.2.1     2023-08-10 [2] CRAN (R 4.3.2)
##  purrr                * 1.0.2     2023-08-10 [2] CRAN (R 4.3.2)
##  R6                     2.5.1     2021-08-19 [2] CRAN (R 4.3.2)
##  RColorBrewer           1.1-3     2022-04-03 [2] CRAN (R 4.3.2)
##  Rcpp                 * 1.0.12    2024-01-09 [1] CRAN (R 4.3.2)
##  RcppParallel           5.1.7     2023-02-27 [2] CRAN (R 4.3.2)
##  RCurl                  1.98-1.14 2024-01-09 [1] CRAN (R 4.3.2)
##  readr                * 2.1.5     2024-01-10 [1] CRAN (R 4.3.2)
##  remotes                2.4.2.1   2023-07-18 [2] CRAN (R 4.3.2)
##  reshape2               1.4.4     2020-04-09 [2] CRAN (R 4.3.2)
##  rhdf5                  2.46.1    2023-11-29 [1] Bioconductor 3.18 (R 4.3.2)
##  rhdf5filters           1.14.1    2023-11-06 [1] Bioconductor
##  Rhdf5lib               1.24.2    2024-02-07 [1] Bioconductor 3.18 (R 4.3.2)
##  rlang                  1.1.3     2024-01-10 [1] CRAN (R 4.3.2)
##  rmarkdown              2.25      2023-09-18 [2] CRAN (R 4.3.2)
##  Rsamtools              2.18.0    2023-10-24 [2] Bioconductor
##  rstudioapi             0.15.0    2023-07-07 [2] CRAN (R 4.3.2)
##  S4Arrays               1.2.1     2024-03-04 [1] Bioconductor 3.18 (R 4.3.2)
##  S4Vectors              0.40.2    2023-11-23 [1] Bioconductor 3.18 (R 4.3.2)
##  sass                   0.4.7     2023-07-15 [2] CRAN (R 4.3.2)
##  scales                 1.3.0     2023-11-28 [2] CRAN (R 4.3.2)
##  sessioninfo            1.2.2     2021-12-06 [2] CRAN (R 4.3.2)
##  shiny                  1.7.5.1   2023-10-14 [2] CRAN (R 4.3.2)
##  ShortRead              1.60.0    2023-10-24 [1] Bioconductor
##  SparseArray            1.2.4     2024-02-11 [1] Bioconductor 3.18 (R 4.3.2)
##  stringi                1.8.3     2023-12-11 [1] CRAN (R 4.3.2)
##  stringr              * 1.5.1     2023-11-14 [1] CRAN (R 4.3.2)
##  SummarizedExperiment   1.32.0    2023-10-24 [2] Bioconductor
##  survival               3.5-7     2023-08-14 [2] CRAN (R 4.3.2)
##  tibble               * 3.2.1     2023-03-20 [2] CRAN (R 4.3.2)
##  tidyr                * 1.3.1     2024-01-24 [1] CRAN (R 4.3.2)
##  tidyselect             1.2.0     2022-10-10 [2] CRAN (R 4.3.2)
##  tidyverse            * 2.0.0     2023-02-22 [1] CRAN (R 4.3.2)
##  timechange             0.3.0     2024-01-18 [1] CRAN (R 4.3.2)
##  tzdb                   0.4.0     2023-05-12 [1] CRAN (R 4.3.2)
##  urlchecker             1.0.1     2021-11-30 [2] CRAN (R 4.3.2)
##  usethis              * 2.2.2     2023-07-06 [2] CRAN (R 4.3.2)
##  utf8                   1.2.4     2023-10-22 [2] CRAN (R 4.3.2)
##  vctrs                  0.6.5     2023-12-01 [1] CRAN (R 4.3.2)
##  vegan                * 2.6-4     2022-10-11 [1] CRAN (R 4.3.2)
##  withr                  3.0.0     2024-01-16 [1] CRAN (R 4.3.2)
##  xfun                   0.41      2023-11-01 [2] CRAN (R 4.3.2)
##  xtable                 1.8-4     2019-04-21 [2] CRAN (R 4.3.2)
##  XVector                0.42.0    2023-10-24 [2] Bioconductor
##  yaml                   2.3.7     2023-01-23 [2] CRAN (R 4.3.2)
##  zlibbioc               1.48.0    2023-10-24 [2] Bioconductor
## 
##  [1] /home/mls528/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /programs/R-4.3.2/library
## 
## ──────────────────────────────────────────────────────────────────────────────